| station | valid | lat | lon | elevation | tmpf | dwpf | relh | drct | sknt | p01i | alti | mslp | vsby | gust | skyc1 | skyc2 | skyc3 | skyc4 | skyl1 | skyl2 | skyl3 | skyl4 | wxcodes | ice_accretion_1hr | ice_accretion_3hr | ice_accretion_6hr | peak_wind_gust | peak_wind_drct | peak_wind_time | feel | snowdepth |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| cat | datetime[μs] | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | cat | cat | cat | cat | f32 | f32 | f32 | f32 | str | f32 | f32 | f32 | f32 | f32 | datetime[μs] | f32 | f32 |
| "EHA" | 2014-09-16 18:00:00 | 37.000801 | -101.879997 | 1099.0 | 74.5 | 57.386665 | 56.404446 | 195.0 | 10.611111 | 0.0 | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | 74.543892 | null |
| "19S" | 2012-09-12 12:00:00 | 37.496899 | -100.832901 | 892.570007 | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null |
| "3K3" | 2016-07-30 00:00:00 | 37.991699 | -101.7463 | 1005.700012 | 78.900002 | 64.5 | 61.701668 | 117.777779 | 7.777778 | 0.0 | 30.060556 | null | 10.0 | 14.0 | null | null | null | null | 5730.769043 | 7000.0 | 8500.0 | null | null | null | null | null | null | null | null | 80.116112 | null |
| "GCK" | 2011-10-11 12:00:00 | 37.927502 | -100.724403 | 881.0 | 51.200001 | 50.228573 | 96.458572 | 161.428574 | 8.0 | null | 29.835714 | 1008.880005 | 6.571429 | null | null | null | null | null | 15500.0 | null | null | null | null | null | null | null | null | null | null | 50.134285 | null |
| "GCK" | 2018-04-22 12:00:00 | 37.927502 | -100.724403 | 881.0 | 37.5 | 36.5 | 96.377502 | 211.25 | 7.625 | 0.0 | 30.202499 | 1023.56665 | 6.125 | null | null | null | null | null | 766.666687 | 2000.0 | 3500.0 | null | null | null | null | null | null | null | null | 31.23875 | null |
Predictive Modeling of Weather Station Data:
Linear Regression vs. Graph Neural Network
Slides: slides.html
Introduction
Accurate weather prediction is a crucial task with widespread implications across agriculture, transportation, disaster preparedness, and energy management. Traditional forecasting methods often rely on statistical models or physics-based simulations, however, with the advancement of graphical neural networks (GNN) we believe there is potential in a more modern deep learning approach (Lam et al. 2023) (Keisler 2022).
In this project, we explore the predictive power of a traditional linear regression model and a GNN on real-world weather station data (Herzmann 2023). Our aim is to evaluate whether the GNN’s ability to incorporate spatial relationships between stations offers a measurable advantage over more conventional techniques.
The dataset consists of multiple weather stations located within the same geographic region. Each station collects meteorological variables over time, and can be represented as a node within a broader spatial network. For the linear model baseline, a single model will be trained using all stations’ data aggregated per feature for each time step.
For the GNN the model will be trained on the entire network of stations, where each node corresponds to a station and edges represent spatial relationships. The graph is encoded via a dense adjacency matrix, excluding self-connections. The GNN aims to leverage the inherent spatial structure of the data, potentially capturing regional weather patterns and inter-station dependencies that are invisible to traditional models (Li et al. 2018).
Our evaluation focuses on forecasting performance over the last 6-months of the dataset. We asses how well each modelling approach predicts key weather variables and investigate the conditions under which one model may outperform the other.
Methodology
1. Cleaning Process
The raw weather station data underwent a multistage cleaning and preprocessing procedures designed to ensure temporal consistency, handle missing values, and prepare the data for both linear and GNN-based models. The steps are as follows:
Original Dataset: The dataset is of shape (time, feature) where each node has a record for each time step resulting in each time having 8 records, one for each node.
Temporal Alignment: Missing time steps were filled for all stations to ensure uniform time coverage across the selected date range.
Temporal Compression: Data was downsampled from 1-hour intervals to 6-hour intervals to reduce noise and improve model efficiency.
Feature Pruning: Features missing more than 10% of their values in over 6 out of 8 stations were removed.
Station Filtering: Stations missing more than two years of valid data were excluded to ensure consistency across nodes (some stations were newer than others).
Correlation Analysis: Performed both between-station and within-station correlation analysis to better understand the spatial and temporal dependencies among features.
Feature Scaling: Remaining features were scaled using appropriate normalization techniques (e.g., min-max, robust scaling) to facilitate model training.
Data Reshaping: The final dataset was transformed into a 3D array off shape (time, station, feature), serving as the unified input format for both the linear and GNN models.
2. Correlation Analysis
The correlation analysis is done to minimize data leakage as many weather features are directly calculated from temperature and as such may allow the model to accurately predict the temperature without actually learning anything from the training process. As such two kinds of correlation analysis were done.
Inter-node correlation
For the inter-node correlation each node was broken into its own dataset and then within the node all features were compared against each other in a standard correlation analysis. This was done to identify any offending feature for removal as data leakage is highly likely in this kind of machine learning task.
Intra-node correlation
For the intra-node correlation each feature was broken into its own dataset and then within the feature all nodes were compared against each other in a standard correlation analysis. Intra-node correlation was done to see the importance of spatial information as if each node has a strong correlation that would mean that there is little to gain from viewing the problem as a graph network.
3. Feature Imputation
Interpolation was incredibly important for this task as the data used in this project comes from real world weather sensors and as such is plagued with quality issues. There are significant chunks of missing data within this dataset so a robust imputation technique was required. As the goal of this project is to compare a GNN vs. an LM it doesn’t make sense to use a GNN for feature imputation. As such a two stage technique was applied.
Spatial Imputation
For the spatial imputation step each time step was analysed and any with missing features had the missing values calculated based on the neighboring nodes and the graph edge weights. This allows the imputation process to only consider spatial relationships for imputing which is more powerful than focusing purely on temporal imputation.
Temporal Imputation
In situations where all nodes are missing a feature it is impossible to perform the spatial imputation step and as such a more naive temporal imputation was required. In this instance as there is little to go on and we don’t wish to use a model for imputation the only thing that could be done was to interpolate between the missing time steps and go from there.
4. GNN
The GNN is structured to model the spatiotemporal dynamics of the weather station network. It is implemented through PyTorch, the architecture is inspired by the Diffusion Convolutional Recurrent Neural Network (DCRNN) (Li et al. 2018).
Graph Structure and Input Format
Graph Representation: Each weather station is represented as a node, and spatial relationships are encoded in a dense adjacency matrix (excluding self-loops)
Data Format: The input is formatted using the StaticGraphTemporalSignal class from torch-geometric-temporal, which supports the temporal sequences on a static graph.
Train/Validation/Test Split:
Testing: Final 6 months of data (730 time steps at 6-hour resolution)
Training: First 80% of the remaining data
Validation: Final 20% of the remaining data
Model Architecture
The model consists of three stacked DCRNN layers followed by a linear projection layer:
DCRN(140, 64) -> ReLU -> DCRN(64, 32) -> ReLU -> DCRN(32, 32) -> ReLU -> Linear(32, 1)
DCRNN Layers: Capture both temporal patterns as well as spatial diffusion through the graph structure.
ReLU Activations: Introduce nonlinearity after each recurrent layer.
Linear Output: Maps the final hidden state to the predicted temperature at the next time step
Training Configuration:
Optimizer: Adam
Learning Rate: Initial rate of 0.01, reduced by a factor of 0.1 upon plateau, with a minimum of 10^-5
Epochs: Trained for up to 100 epochs with early stopping based on validation loss plateau
The GNN uses the preceding 28 time steps to forecast the next temperature value. However, it is equipped to model both temporal trends and spatial interdependencies, which the linear model is not explicitly designed to capture.
5. Linear Model
The linear baseline model was designed as a univariate time-series regression task. It uses the preceding 28 time steps (equivalent to 7 days) to predict the target variable tmpf at the next time step.
Model Input:
At each time step t, the input vector aggregates the five base features across all stations:
\[ \text{tmpf}_{t+1}=\text{features}_t+\text{features}_{t-1}+...+\text{features}_{t-27} \]
Where:
\[ \text{features}_t=\text{\{tmpf, }\text{relh, }\text{sknt, }\text{drct}_{sin}\text{, drct}_{cos}\text{\}} \]
- tmpf: Temperature
- relh: Relative Humidity
- sknt: Wind Speed
- \(\text{drct}_{sin}\text{, drct}_{cos}\text{: }\)Wind Direction encoded as sine and cosine components
This flattened representation allows a standard linear regression model to be trained on a fixed-length vector for each prediction target and simplifies the graph structure of the weather stations. This process of flattening the stations into a single point does remove any spatial information from the model, however, due to the close proximity of these nodes this makes little to no impact on the final result.
The linear model uses the same training/test split as the GNN excluding the validation set.
Analysis and Results
Data Exploration and Visualization
1. Data Source:
The dataset used in this project was sourced from the Iowa Environmental Mesonet (IEM) hosted by the Iowa State University (Herzmann 2023). The data follows observational standards set by the FAA Automated Surface Observing System (ASOS) (Administration 2021). For this project, we selected a subset covering a 10-year period from 2010 to 2020, focusing on nine weather stations located in south eastern Kansas. These nodes were chosen based on geographic proximity and consistency of data availability.
2. Data Structure
The original dataset contains:
33 features
8 stations
96,408 hourly time steps
With intermittent missing values across both time and stations
A subset of key features relevant to this project is summarized below:
| Feature | Description |
|---|---|
| station | Station identifier code (3-4 characters) |
| valid | Timestamp of the observation |
| lon | Longitude |
| lat | Latitude |
| elevation | Elevation in feet |
| tmpf | Air temperature (F) |
| relh | Relative humidity (%) |
| drct | Wind direction (degrees) |
| sknt | Wind speed (knots) |
| p01i | Precipitations (inches) over the previous hour |
| vsby | visibility (miles) |
As well as a slice of the unprocessed csv data:
3. Exploratory Data Analysis (EDA)
Initial exploratory analysis focused on filtering out low-quality features and stations as well as general reduction in the dimensionality of the dataset:
Features with more than 10% missing values were removed.
Stations with excessive missing data during the 2010-2020 window were dropped. One station was excluded entirely as it was introduced after 2020, and thus had no data within the selected range.
The remaining dataset was evaluated to ensure sufficient temporal coverage and consistency across stations and features.
The visual below shows all features that meet these conditions for varying time slices, as well as a 0/1 flag for if the station is valid within the time slice.
With this visual a date range was selected from 2018 to 2020 as this range had the most valid features and stations while also being quite recent. As this range was selected the ULS station was dropped resulting in seven valid stations and six valid features.
Below is a slice of the dataset at this stage of the EDA process:
| station | valid | lat | lon | elevation | tmpf | dwpf | relh | sknt | feel | drct_sin | drct_cos |
|---|---|---|---|---|---|---|---|---|---|---|---|
| cat | datetime[μs] | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f64 | f64 |
| "GCK" | 2018-01-01 00:00:00 | 37.927502 | -100.724403 | 881.0 | 9.283334 | -6.5 | 48.148335 | 8.666667 | -4.161667 | 0.851117 | 0.524977 |
| "LBL" | 2018-01-01 00:00:00 | 37.044201 | -100.9599 | 879.0 | 12.316667 | -1.983333 | 52.014999 | 10.166667 | -1.721667 | 0.664796 | 0.747025 |
| "EHA" | 2018-01-01 00:00:00 | 37.000801 | -101.879997 | 1099.0 | 15.555555 | 5.255556 | 63.382778 | 7.388889 | 4.482222 | 0.970763 | 0.24004 |
| "HQG" | 2018-01-01 00:00:00 | 37.163101 | -101.370499 | 956.52002 | 14.311111 | -1.605556 | 48.468334 | 7.777778 | 2.681667 | 0.936332 | 0.351115 |
| "3K3" | 2018-01-01 00:00:00 | 37.991699 | -101.7463 | 1005.700012 | 13.1 | -0.9 | 53.127777 | 6.777778 | 2.151111 | 0.981255 | 0.192712 |
| … | … | … | … | … | … | … | … | … | … | … | … |
| "EHA" | 2020-12-31 00:00:00 | 37.000801 | -101.879997 | 1099.0 | 42.355556 | 16.588888 | 34.955555 | 3.0 | 40.254444 | -0.533205 | -0.845986 |
| "HQG" | 2020-12-31 00:00:00 | 37.163101 | -101.370499 | 956.52002 | 40.722221 | 13.255555 | 32.23111 | 1.666667 | 39.450001 | 0.977334 | -0.211704 |
| "3K3" | 2020-12-31 00:00:00 | 37.991699 | -101.7463 | 1005.700012 | 40.200001 | 12.2 | 31.377777 | 4.555555 | 36.683334 | -0.700217 | -0.71393 |
| "JHN" | 2020-12-31 00:00:00 | 37.578201 | -101.7304 | 1012.710022 | 40.711113 | 17.4 | 38.554443 | 4.777778 | 37.06889 | -0.824675 | -0.565607 |
| "19S" | 2020-12-31 00:00:00 | 37.496899 | -100.832901 | 892.570007 | 39.922222 | 15.388889 | 36.552223 | 6.111111 | 35.113335 | -0.824675 | 0.565607 |
4. Graph Creation
To prepare the dataset for graph-based modeling, a spatial graph was constructed:
Each station was treated as a node.
A dense adjacency matrix (excluding self-connections) was created by computing geodesic distances between stations.
Edge weights were defined as the inverse of the geodesic distance, scaled to a [0, 1] range using MinMax scaler. The closer two stations are, the stronger their connection in the graph.
Below is a visual of the full graph structure:
5. Spatiotemporal Imputation
Missing values were imputed through a two-stage process leveraging both spatial and temporal structure:
Spatial Imputation: Each missing value was estimated based on the value of neighboring nodes within the same time step, weighted by graph connectivity.
Temporal Imputation: Remaining gaps were filled by interpolating along the time axis for each node individually.
While not a perfect method, this approach produced plausible and continuous data, as visually confirmed during quality checks as shown below.
Below is an example of the data requiring both spatial and temporal imputation
Below is the same data post imputation:
6. Correlation Analysis
To avoid feature redundancy and data leakage, a correlation analysis was conducted:
Inter-node and Intra-node correlations were computed.
Inter-node correlation was done to find any relationships between features within a node.
Intra-node correlation was done to explore the importance of spatial information across features.
Two features, dwpf (dew point in f) and feel (feels-like temperature in f), showed high correlation with the target variable tmpf (temperature in f). Since these variables are partially derived from the target variable tmpf, they were removed to maintain model integrity and avoid leakage.
Below shows the inter-node correlation:
Below shows the intra-node correlation:
7. Final Preparation
After all preprocessing steps, the final dataset was reduced and standardized:
5 features were retained: tmpf, relh, sknt, drct_sin, drct_cos
7 stations remained after filtering
4,381 time steps at 6-hour intervals (equivalent to 2 years of data) remained
The remaining unscaled features were scaled using the RobusScaler from scikit-learn to mitigate the influence of outliers while preserving overall data distribution. Which resulted in the below array.
shape: (30_667, 6)
┌─────────┬───────────┬──────────┬───────────┬───────────┬───────────┐
│ station ┆ tmpf ┆ relh ┆ sknt ┆ drct_sin ┆ drct_cos │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ cat ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞═════════╪═══════════╪══════════╪═══════════╪═══════════╪═══════════╡
│ GCK ┆ -1.355721 ┆ 0.481483 ┆ -0.080357 ┆ 0.851117 ┆ 0.524977 │
│ LBL ┆ -1.265174 ┆ 0.52015 ┆ 0.160714 ┆ 0.664796 ┆ 0.747025 │
│ EHA ┆ -1.168491 ┆ 0.633828 ┆ -0.285714 ┆ 0.970763 ┆ 0.24004 │
│ HQG ┆ -1.205638 ┆ 0.484683 ┆ -0.223214 ┆ 0.936332 ┆ 0.351115 │
│ 3K3 ┆ -1.241791 ┆ 0.531278 ┆ -0.383929 ┆ 0.981255 ┆ 0.192712 │
│ … ┆ … ┆ … ┆ … ┆ … ┆ … │
│ EHA ┆ -0.368491 ┆ 0.349556 ┆ -0.991072 ┆ -0.533205 ┆ -0.845986 │
│ HQG ┆ -0.417247 ┆ 0.322311 ┆ -1.205357 ┆ 0.977334 ┆ -0.211704 │
│ 3K3 ┆ -0.432836 ┆ 0.313778 ┆ -0.741072 ┆ -0.700217 ┆ -0.71393 │
│ JHN ┆ -0.417579 ┆ 0.385544 ┆ -0.705357 ┆ -0.824675 ┆ -0.565607 │
│ 19S ┆ -0.441128 ┆ 0.365522 ┆ -0.491072 ┆ -0.824675 ┆ 0.565607 │
└─────────┴───────────┴──────────┴───────────┴───────────┴───────────┘
Modeling and Results
Model 1: Graph Neural Network (DCRNN)
The Graph Neural Network was trained using the previous 28 time steps (equivalent to 7 days) and leveraged a dense spatial graph connecting all stations. This structure enabled the model to learn both temporal sequences and spatial diffusion patters across the weather station network.
Graph Structure: Dense graph with edge weights based on inverse geodesic distance
Architecture: Three stacked DCRNN layers + ReLU activations + Linear projection
Target: Next-step temperature prediction for each node
Results:
MSE: 0.0562
It is visually apparent that the model is able to follow the temporal trends of the weather data with some latency in predictive results appearing to be quite accurate.
However, it is also apparent that there is a weird latency in the results of the prediction as an absolute error of 1 is quite extreme given the total range of the test features is -1.5 to 1.5.
Model 2: Linear Regression
The linear baseline model was trained using the same 28-time-step history with aggregated weather station data to predict the next time step’s temperature. All features were flattened into a single vector, treating the problem as a high-dimensional regression task with no spatial awareness.
Strengths: Simplicity, interpretability, low training time
Weakness: Cannot leverage inter-station relationships or dynamic spatial trends.
MSE: 0.0147
It is visually apparent that the model performs quite a bit better than the GNN as there is little latency in the predicted results.
It is also clear that the maximum absolute error is 0.8 which is quite a bit less than the 1 for the GNN model.
Comparing Errors
When both models are compared against each other per station it becomes apparent how poorly the GNN is performing on these predictive tasks. Overall the LM shows an average low error rate as well as more consistent results. This shows that despite the fact that the linear model is unable to consider spatial information, in this case spatial information may not actually be important for the applied dataset.